Skip to content

Expose native Lance scan descriptor for datafusion-comet integration#624

Draft
wirybeaver wants to merge 2 commits into
lance-format:mainfrom
wirybeaver:xuanyili/native-lance-read-descriptor
Draft

Expose native Lance scan descriptor for datafusion-comet integration#624
wirybeaver wants to merge 2 commits into
lance-format:mainfrom
wirybeaver:xuanyili/native-lance-read-descriptor

Conversation

@wirybeaver

@wirybeaver wirybeaver commented Jun 12, 2026

Copy link
Copy Markdown

Closes #623.

Summary

This adds a stable native-read descriptor for ordinary Lance Spark scans so native engines can consume Spark-planned Lance reads without depending on Lance Spark internals.

The descriptor captures:

  • Dataset URI and resolved dataset version.
  • Spark read schema JSON and projected read schema JSON.
  • Projected columns, pushed filter SQL, limit/offset, batch size, and storage options.
  • Per-partition native splits with Lance fragment IDs.
  • Explicit fallback reasons when a scan cannot be represented by the minimal v1 descriptor.

The v1 scope is ordinary table reads only. Search/hybrid search, index-backed execution descriptors, aggregation pushdown, metadata/version columns, and namespace-backed credential refresh remain fallback/future work.

Why LanceScan carries the descriptor state

This PR adds several parameters to the LanceScan constructor because LanceScan is the object Spark keeps inside BatchScanExec after planning. A native consumer such as Comet sees that final BatchScanExec(scan = LanceScan) object, not the earlier LanceScanBuilder or catalog planning context. Therefore nativeScanPlan() needs the complete, already-resolved scan snapshot on LanceScan itself.

The goal is not to make the constructor a broad public API. The goal is to avoid asking native consumers to infer or recompute Lance Spark planning semantics from partial state. Re-planning later would be risky because it could reopen a different dataset version, use different storage options, produce different fragments, or miss fallback-only state such as pushed TopN/aggregation.

The added state is the minimum needed to describe or reject the native v1 scan accurately:

  • sparkReadSchema and schema keep the Spark-visible schema and projected read schema separate, which matters when Spark-facing fields differ from the physical/projection schema.
  • readOptions provides dataset URI, resolved version, batch size, table/catalog identifiers, and user storage options. The resolved version is required so native execution cannot drift to a newer Lance snapshot.
  • whereConditions, limit, and offset are serialized into the descriptor when v1 supports them.
  • topNSortOrders and pushedAggregation are carried even though v1 falls back for them, because the descriptor must reject those scans explicitly instead of silently dropping semantics.
  • pushedPredicates, zonemap stats, surviving fragment IDs, precomputed splits, and fragment row counts preserve the fragment-pruning and limit-pruning decisions Lance Spark already made on the driver.
  • activeShardingExpression and fragmentShardingKeys preserve the existing partitioning/reporting contract used by Lance Spark planning.
  • initialStorageOptions, namespaceImpl, and namespaceProperties preserve storage option precedence and the namespace context that workers/native readers need for the same dataset access path.

A follow-up cleanup could wrap this constructor state into an internal immutable scan-state object if reviewers prefer that shape. This PR keeps the change direct so the descriptor contract and tests are easy to review first.

Testing

  • ./mvnw -pl lance-spark-base_2.12 -Dtest=LanceScanTest -Dspotless.skip=true test
  • ./mvnw -pl lance-spark-4.1_2.13 -am -Dtest=LanceScanTest -Dspotless.skip=true -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl lance-spark-base_2.12 spotless:check

@github-actions

Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose a stable native scan descriptor for Lance Spark reads and datafusion-comet integration

1 participant